AITopics

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.58)

Neural Information Processing SystemsFeb-17-2026, 15:45:29 GMT

b5fd95d6b16d3172e307103a97f19e1b-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (18 more...)

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
(2 more...)

Neural Information Processing SystemsNov-20-2025, 22:07:28 GMT

Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices

name change, neural network, speech recognition, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.58)

Neural Information Processing SystemsOct-10-2025, 14:11:20 GMT

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech.

arxiv preprint arxiv, dialogue, speech, (13 more...)

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
(2 more...)

Firc, Anton, Chhibber, Manasi, Mishra, Jagabandhu, Singh, Vishwanath Pratap, Kinnunen, Tomi, Malinka, Kamil

STOPA: A Database of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution

arXiv.org Artificial IntelligenceOct-10-2025

A key research area in deepfake speech detection is source tracing -- determining the origin of synthesised utterances. The approaches may involve identifying the acoustic model (AM), vocoder model (VM), or other generation-specific parameters. However, progress is limited by the lack of a dedicated, systematically curated dataset. To address this, we introduce STOP A, a systematically varied and metadata-rich dataset for deepfake speech source tracing, covering 8 AMs, 6 VMs, and diverse parameter settings across 700k samples from 13 distinct synthesisers. Unlike existing datasets, which often feature limited variation or sparse metadata, STOP A provides a systematically controlled framework covering a broader range of generative factors, such as the choice of the vocoder model, acoustic model, or pretrained weights, ensuring higher attribution reliability. This control improves attribution accuracy, aiding forensic analysis, deepfake detection, and generative model transparency.

artificial intelligence, dataset, machine learning, (17 more...)

doi: 10.21437/Interspeech.2025-2065

2505.19644

Country:

Europe (0.29)
North America > United States (0.28)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsOct-9-2025, 22:25:32 GMT

2f803abdcad9de35b45d5a656dade45c-Paper-Datasets_and_Benchmarks_Track.pdf

dataset, foundation model, spectrogram, (16 more...)

Country:

Asia > Taiwan (0.04)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(11 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(4 more...)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Data Science > Data Quality (0.93)
(2 more...)

Cao, Xinwei, Fan, Zijian, Svendsen, Torbjørn, Salvi, Giampiero

Segmentation-free Goodness of Pronunciation

arXiv.org Artificial IntelligenceJul-30-2025

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

artificial intelligence, machine learning, natural language, (19 more...)

2507.16838

Genre: Research Report > New Finding (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(3 more...)

Kawamura, Masaya, Hasumi, Takuya, Shirahata, Yuma, Yamamoto, Ryuichi

BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing

arXiv.org Artificial IntelligenceJun-5-2025

This paper proposes a highly compact, lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QA T), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values {-1, 0, 1 } . Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardware that treats values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size, while outperforming the baseline of similar model size without quantization in synthesis quality.

machine learning, natural language, quantization, (17 more...)

2506.03515

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)

Tosolini, Alessio, Bowern, Claire

Multilingual MFA: Forced Alignment on Low-Resource Related Languages

arXiv.org Artificial IntelligenceApr-25-2025

We compare the outcomes of multilingual and crosslingual training for related and unrelated Australian languages with similar phonological inventories. We use the Montreal Forced Aligner to train acoustic models from scratch and adapt a large English model, evaluating results against seen data, unseen data (seen language), and unseen data and language. Results indicate benefits of adapting the English baseline model for previously unseen languages.

artificial intelligence, boundary, machine learning, (17 more...)

2504.07315

Country: North America > Canada > Quebec > Montreal (0.25)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceDec-31-2024

Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language

Rutowski, Tomek, Harati, Amir, Shriberg, Elizabeth, Lu, Yang, Chlebek, Piotr, Oliveira, Ricardo

Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.

artificial intelligence, machine learning, test size, (18 more...)

2501.00617

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)